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(54) Multi-point voice conferencing ayatem ovw a wida area network 



(57) An Interactive network system (1 OD) ccxnmuni- 
catss speech and associated Infonmaiion among a plu- 
Fallty of participants at different sitgs (104. 106). An ex- 
ample of the asscxiiaied infonnatlon Id lip aynch image 
intoimatlon related to the speech. The system contains 
a speech son/er (110) tar managing data atreams set by 
the partlcipante. Each paitictpant uses a multimsdia 
computer (114) and a modem (122) to connect to the 
network Because many modems hava a lofw bJt rate. It 
Js important to compress the speech and associated in- 
formatbn. The server (110) receivas the data streams 
from at least two parilclpants and contatna means (200) 
for combining these data streams into a single data 
stream having a bit rate that can be handled by the mo- 
dem of the third participant. As a result, a plurality ot 
participants can conduct apeech and image communi- 
cation using the network. 




FIG.l 
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Description 

Field of the Invention 

5 The prasant invention ralat&s to wicf9-ar©a network oommunJcatlon, and In particular, to Interactive communication 

of speech and related information over a wids^raa network. 

Bdckground of the lnv«ntton 

10 Computer technology want Ihrougli major davefopmente during the paat aeveral yeare. The fliet development id 

the avallablRty of low coet yet powerfull personal computers. The reduction in cost makGS computors aSordabCo to a 
large number of peopla. As a result, the number of compuiere increases exponentially. The processing power of these 
computers is more than that of mainframe computers existed about ten yoare ago. Further, these computers lyplcajry 
come with modsnrts, sound cards, high resolution video boards, eta, which albwthem to process muttimedia informa- 

7^ lion, such as speech and images. 

The sacond development is the popularity of a wide-area network called internet. Internet is cun-ently the largest 
computer network In existence. II is a wortdwids interconnection o1 millions of computers, from low end personal com- 
putarsto high-end mainframes. The Internet grew out of work funded in the 1960s by the U. S. Defense Department's 
Advanced Rssaarch Projects Agency. For a long time, Internet was used by researchers In untveraltles and nattonal 
laboratories to share tnfonmatjcn. Ae the existence of the Intemsi became more widely known, many users outside of 
the academe/research community (e.g, employees of large corporations) started to use Internet to carry electronic 
mails. In 19B9, a wide-area Intormatron system known as the Wortd Wide Web (the Wab") was daveloped The Wsb 
13 a wide^ea hypormedH inf ormatkin retrieval system aimed to give universal access to a large universe of documents. 
A user can use a software (called a browser} to retrieve web documents (typically dispFayed in graph k: fonn) arvj 

25 navigate the Web using simple commands and popular tools such as pointnand-cltek. Because the user does not have 
to be lechnlcalty trained and the browser is pleasant to use, it has the potential of opening up the Intemal to the masses. 
Consequently, many communication companies have developed hardware and software products whteh altow peopte 
to use their computers to accesa the internet. 

Because of these dsvofcipmenta, many people have the resources to electronically communicate with other people 

^ using the InterneL Currently, most o( the communicatton involve text (e.g., electronic mails) and graphics (e.g. Wab 
documents). Further, the mode of communication is passive. Le.. the information can be read or displayed by recipients 
a long time (e.g., hours days) attar its creation. 

It is known that human beings enjoy Interacllon wJlh other people. It is also known that speech and facial expression 
associated with speech are powerful communk:atk)n tools. Thus, it is desirable to use the Internet to interactively 

55 communicate spaach and associated facial expression. Cun*ently, there Is no product that can efficiently achieve this 
mode of communication. 

Summary of the Invention 

^ The present invention can be used in an interactiva network system for communieattng speech ar)d associated 

information among a plurality of participants at different sites. Each participant uses a multimedia computer to connect 
to the network. The multimedia computer contains a mk:rophona, at least one kiudspeaker, a display device and a 
network accessing device (such as a modem). A speech processing software is sxecutirig In the computer while an 
interactive speech sesskxi takes place. The participant can epeakioihe microphone. The software encodes the speech 

4« and associated data, and sends the data to a speech server using the network accessing device. The speech server 
can accumulate speech data from one of more participants at drfferam sites, and deliver a combined speech data to 
a designation site. The software In the multl-medla personal contputsrcan decode infonnatkin received from the senrer, 
and reproduce the speech through its toudspeaker 

Many modems have a low conrtmunlcatlon rate (i.e., bits-per-second). Thus, it is important to compress the speech 

50 and associated data so that it can b© handled by the modems. The compression method of the present invention tal(es 
Into account of the characteristics of speech eo as to be able to communicate speech within the communx:ation rate 
of the modem. 

One aspect of the compression method Is a novel method to obtain the acoustic characteristics oi the echo path 
and perform echocancellatbn. The system can also detect silence. During a silence period, there is no need to transmit 
»5 speech Information. Thus, the ability to accurately detect silence can increase the compresskxi ratio. This method 
irwolves determining residual echo energy, and use this energy in the determination of silence. 

TTie echo energy Is an acoustic characieristlc of the user site. The pt^Qf\X invention also involves a novel method 
of measuring arxj calit>rating the acoustk: characteristics of the user site. 
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The present invention also uses novel methods to compr9s$ speech data by using a combination ot vector and 
scalar quantizaiion of linoar predictive coding parametera, multl-pcilse excitation pararngtars, putee position coding, 
and pulse amplitude coding, either individual ty or in combination, Furtlief, speech compressed using these molhods 
can be decompressed, thereby recovering the original speech. 

s The present Invention also involves novel and computationally eJficient lip synching methods. It is obsen^od thai 

the lip positions in human beings can be determined to a large extent by the first two peak*, known as torment fre- 
quencies, of the short term spectra of speech and the.short-term energy of the speech stgnaL Further, the variables 
associated with the lip posllfona are highly correlated' th© prssani invention exploits these relattonahip to code the lip 
positions using a very small nurrdjer of bits. These bite are sent to a dealination site (via the sen/er). The computer at 

to the dsfitinatlon sit© can draw a face with appropriate lip positrons on the display device. 

Briaf Doaoription of the Drawings 

Ffg. 1 id a drawing showing a multi-point voice conferencing system of the present invention. 
16 Fig. 2A is a block diagram of an encoder of the present invention. 

Fig. 2B Id a block diagram of a decoder of the preseni invention. 
Fig. 2C is a schematic diagram of a participant site of the present invention. 

Fig. 3 is a block diagjem of an arrangement for measuring acoustic characteristics in accordance unth the present 
invention. 

Detailed Description of Ihe Invention 

The present Invention comprises a novel voice conferencing system and associated methods. The fol towing da- 
ecrlptron ts presented to arable any person skilled In the art to make and use the invention. Dsscnptbns of specific 
appllcattons are provided only as examples. Various modjftcatlonetothe preferred embodiments will be readily apparent 
to those skilled In the art, and the general principles defined herein may be applied to other embodiments and appli- 
cations without departing from the spirit and scope of tho invention. Thus, the present inventton is not intended to be 
limited to the embodiments shown, ijut Is to be accorded the widest scope consistent with the principles and features 
disclosed herein. 

30 Fig. 1 is a drawing showing a multi-point voice conferencing system 100 of the present Invention. System 100 

comprised a plurality of user stations, such as stations 104-106, and a speech sender 110. User stations 104-108 and 
speech sen/er 11Q are connected to a data network such as the Internet 112, 

The structure of the user stations are similar. Thus, only one of the stations (for example, station 104) will be 
described In detail here. Station 104 comprises a desktop computer 114, a display device 116, a microphone 113. two 

35 loudspeakers 11 9 and 120, and a modem 122. Desktop con^puter 114 contains software (not shown) which provides 
a connection to Internet 112 usbig modem 122. A person in station 104 can talk to microphone 118. The speech is 
then processed by a software 123 of the present Invention executing in desktop computer 114. The processed speech 
is then sent to server 110 via Internet 11^ Software 125 can also process speech Intomaatlcn received from sender 
110 and send the output to toudspeakers 119 and 120. Two loudspeakers are used in the present embodiment to 

40 generate stereo effects. It should be appreciated that one loudspeaker could be used If stereo effect is needed. 

In this embodiment, all speeches are sent by the user Stallone to speech sen/er 110, which then directs the speech- 
es to their destination. In addition lo speeches, l(p synch rig informatton is also sent and received by the ueer stations 
{again routed through server 110). 

One aspect of the present invention is methods to reduce the bandwidth requirement for speech and lip-synching 

^ infomiatton transmisston. An important factor affecting the performance of system 1 00 is the speed of data transmfssfon 
between user stations and the setver. Currently, the transmission speed of a nrtodem In a dfal-up telephone line Is 
around 30 kitotHts per second (kbps). Even when an ISDN (Integrated Sen/icee Digital Network) line is used, the trans- 
mission speed is about 1 2S kbps. On the other hand, the transmission of uncompressed speech alone required a higher 
bandwidth than can be supported using currant telephone lines. If it is necessary to transmit other infOmrtation (e.g., 

50 control data and llp-eynchlng information), the bandwidth requirement Is much higher. Consequently* there is a need 
to reduce the bandwklth requirement 

Software 12$ contains an encoder and a decoder Fig. 2A is a btoek diagram of an encoder 200 in accordance 
with the present inventron. Various blocks of encoder 200 will be discussed in detailed in this specification. Encoder 
200 accepts speech Input digitized at S kHz and a width of 15 bits per sample. The DC component of the digitized 

SB speech is removed by a DC removal block 204. Encoder 200 also contains a silence detector 20a When a period of 
silence ie detected, there is no need to transmit speech information. Encoder 200 contains an automatic gain control 
210 so as to albw speech having a Wide dynamic range to be adequately processed by the system. Encoder 200 
contains a voice morpher block 214 for allowing a change of the characteristics of the speech. The speech <feta then 
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goes to a linsar prodiciivo coding (LPC) analysis block 222 to generate a plurality of LPC parameter. The paramaiors 
pasa through a two-stage vector quantization block 224. The re$ult is delivered to a scalar quantization block 225 and 
a lip informaiion sxtractton block 250. The resulting parameters from block 226 ara delivarod lo a multi-pulse sxcttaTlon 
paranieters generator 228, a robot/wtiisper Qxcrtation generator 230. and a bit stream encoder 232. The output oJ robot/ 

s whispgr excitation generator 230 Is also delVered to bit stream encoder 232. The ompui Of multl-pulso excitaiion pa- 
rameter genemlor 22B is couplsd to a pufse-posrtion coder 238 and a pulse-amplitude coder 238. The outputs of coders 
-.,236 and 233 are delivered to bit stream codsr 232. 

The output of lip information eixtractor 250 la deth/ered to a lip fitter 252» which is connectGd to a lip information 
coder 2S$. The reeutt of lip information coder 2SS is sent to bit stream encoder 232. 

10 As explaridd in more detail below, the Input epeech [digitized at 3 kHz and having a width of 1 S bits per sampla) 

can be compressed such that the output of bit stream encoder 232 is approxlmaiety 1 2.800 brts-per-^econd. This d^ta 
stream is sent to speech server 110 via the Internet- 
Fig. 2B is a biock diagram of a decoder 300 ^vhlch can decode bit streams received from spsach server 110. 
Decoder 300 contains a bit stream decoder 304 which is abla to recover infornriation from the bits sent by speech server 
110. The infom^ion is sent to a Dp Informaibn decoder 306. which generates lip positron information to a lip synch 
program tor generating graphics of lips. The infom^tion from bit stream decoder 304 le aleo sent lo a LPC parameter 
decoder 308. a pulse amplrtuda decoder 310, a pulee location decoder 312, and a whispsrfrobot excitation decoder 
314. The outputs of these decodera are sent to a LPC synthesizer 320 which generates digital epeech signals. The 
digital speech signal is sent to an adaptive post filter 322, which also accepts informatbn from L,PQ parameier decoder 

^ 303. Jht output of adaptive post filter 322 is sent to an echo filter 324 and then to a bandpass fiiter 326, The resuH is 
a decompressed stereo speech which la transformed into acoustk; sound by ksudspeakers 119 and 120, 

Fig. 2C 13 a schematic diagram of a participant site 350 of the present Invention. A participant 352 speak to a 
microphone 354, which tfansform the sound signal fnto electrical signaL Typically, the site coniains background nolso 
(e,g., from a radio). The electrical si^al is combined with signals from an adaptive echo cancelor 362 in a multimedia 

SB pgrsonal computer 360. The result is alrrx^st echo free speech. This signal is sent to an adaptive sRence detection end 
software automatic gain control (AGC) module 364. The output of this module is sent to a lip-synching and speech 
codec 366. Codec 368 Incorporates the ability to disguise voice (i.e., voice morph). Codec 366 oompreseesand encodee 
the epsech data» and send the data to speech server 110 through a bidirectional communicatkxi link 363. 

Codec 356 also Include software to decode and decompress speech data recah/ed from speech sen/er 110. After 

30 decoding and decompressing, the data is sent to a wave mixer 370, which also accepts data from an ambient sound 
generator 372. The signal generated by wave mixer 370 is sent to adaptive echo cancelor 362 and two loudspeakers 
375 and 377. 

Detetmlnatlon of Iho Acoustic CharaotorietiCd Of a User Station 

35 

When computer t14 in user station 104 is first turned on, the acoustic characteristcs of user station 104 need to 
be determined. This involves the determination of the buDc delay and the acoustic transfer function of user statkxi 1 04. 

rig. 3 is a block diagram of an arrangement capable of measunng the acoustc characteristics of user station 104. 
Like elements in Figs. 1 and 3 share like rsference numerate. This arrangement comprlees an audio transmitter block 
^40 1 30 and an audio receiver block 140. Block 1 30 Includes loudspeaker 11 9 which generates an audio signal in response 
to the magnitude of an analog electrical signal applied thereto. The analog signal is derived from the content of a digital 
play buffer 134. Bits in buffer 1 34 are shifted to a digiial-tOnanalog converter (not shown) at a predetermined rate. Thus, 
the audte signal generated by loudspeaker 1 1 9 is determined by the bits In buffer 1 34. Audio transmitter block 1 30 alao 
contains a signal generation block 1 32 tor generating bits for storing into buffer 1 34 so as to cause loudspeaker 11 9 
^ to generate audio signals having a specified relatnnship between amplitude and time. 

Audk) receiver block 140 contains microphone 11 S tor receiving the audio signals generated by loudspeaker 119. 
The received audb .signals are converted to digital form by an anak>g-to-digital converter (not shown). The digital 
infomiation is shifted to a record buffer 142 at a predotemilned rale. A signal analysis block 144 analyzes the digital 
Infomnatlon h buffer 142 to generate a rstetranship between the amplitude of the received audto signal and lime. 
£0 One aspect of the present Invention fe a simple method to determine the acoustic characteristics of user station 

104. This method involves the following steps; 

(1) measure the bulk del^ between buffer 1 34 and buffer 142 through loudspeaker 119: 

(2) measure the transfer function between loudspeaker 1 1 9 and microphone 118; 

55 (3) measure the bulk delay between buffer 134 and buffer 142 through loudspeaker 120: and 

(4) measure the transfer func^n between loudspeaker 1 20 and mk:rophone 113. 

The acoustb characteristics of user station 104 may change overtime during a virtual worid session. Consequsntti( 
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the aoouatkj system needs to ba "ra-trained* at practetemnin^d timw. One aspect of the pre&ent invontlon is a non- 
intaisive way to r9-traln the acoustic system. 

Measuring the Bulk Delay 

B 

The measufement Involves (i) generating a chirp signaJ by signal generation block 132, (11) filling buffer 134 with 
bits corresponding to the chlfp Signal, and (ilO transmitlina acoustic signal of the chirp signal by loudspeaker 11 9. The 
signal receh/ed by microphone 11S is than analyzed by signal analysis block ^M. 

The chirp signal is defined ae 

10 

IS and 0 otherwica, N Is the length of the chirp signal. 

The signal roceivad by the microphone is processed by a matched fitter, which is a digital finite Impulse response 
(FIR) filter, with an Impulse response of 

and 0 othenA^se. N Is the length of the chirp signal defined above. 
Tha output of the matched flllef Ja 

0 

so 

where k(n) is the signal recorded into buffer 1 42. Lat T' be the instant at which the matched fitter output has maximum 
amplitude, then the bulk delay for the channel Is estimated as (T-N). In the cun-ent implementation, the value ofN used 
is 512. The e^tpected bulk delay range is 1 ,OCX) to 3,000 samples. 

A similar bulk delay of the toudspeaker 120 and microphone 113 pair Is obtained by the same method. 

38 

Measuring the transfer Function 

The measurement fnvolves generating a whrte-noise signal by signal generatksn bkx:k 132 and transmitting its 
associated acousfc signal by loudspeaker 119, Than by using the signal received by microphone 118 and the white 
^ noise transmitted, the Impulse response of the echo channel (from loudspeaker 119 to the mtcrophona 113) is deter- 
mined as follows. Let x(n), n = 0, 1 » (L-1) bs tha whita noisa sequence, y[n) he tha signal received by the microphone, 
h(n), n = 0.1 .....(M-7 ) be the impulse response of the echo channel and B be the bulk delay of the channel, than 

yin) = £x(;i - B -kyhfjc) ^w^ 

Q 



wheie kVrt is tha background noisa in tha system. A value of M=3 30 Is sufficient to perform good echo cancellation. 
^ Using the least-squares estimation technique, the impulse response of the channel can be estimated by solving the 
matrix equation 



ss 



where is tha (M by M) auto-correlatksn matrix of the white noise, la an (M by 1) crcss^orrelatjon vector and 
H= [/3(0)/7(1)... h[M - 1)]^is tha echo filter impulse response vador. /f^^ is computed as 



5 



PAGE 57/81 * RCVD AT 2/10/2005 3:58:12 PM [Eastern Standard Time] ' SVR;USPTO-EFXRF-1/10 * DNIS:8729306 ' CSID:4803855061 ' DURATION (mm-ss):2W8 



Feb. 10. 2005 2:14?fv1 INGRASSIA FISHER & LORFNZ PC 



Mo. 736? P. 14/37 



EP 0 779 732 A2 



RjJ)=Tr'i"-'yx(»-j), for i=0.1,....(jW-!) afld;=0,l,..,CW-l). 

5 

Rj^ is computed as 



^0 R (l)=^x{nyy(p^B^O for /^O.U.(A/-I), 



Simitar cabulstion is performed to obtain the characteristics of the echo channel between the loudspeaker 120 and 
IS Xbo microphona 118 pair. It is prsfgrablo to u&g a valuo of L = lO^M to obiain accurato astinnat9 at echo channst ttltsr 
characteristicd. 

Qubk training 

£0 Quick training is psrformdd if ths user parcsives that the echo cancsflaiion is not yielding salisfactory resutts or 

the aysiem detects the same. Quick training can be perlomied to adjust ihe bulk-delay and/or the echo channel gain 
of user station 104^ changsd as a consequence of altering the speaker volume control. II can also detect if tha full- 
training Is needed. 

The qufck training involves (i) toudspeaker 119 transmitting an acoustic signal x(n) =Q*h[M' 1 - /:) ft?rn= 0,1, 
^ (Af - 1) artdO otherwiae. and (ii) microphone 183 receiving the acouatio signal. y(n). The value of G is choeen to be 

30000/MDr(|A(n)|) 

so 

The symbol ^T is used to designate the instant at which this signal (i.e. , y(n)} peaks. First it is determined if full training 
Is needad. If the echo channal charactaristics havs not dav'aied much from iha pravious catculations, tha shapa of tha 
received signal (i.e. , y(n)) around ft peak (i,e.i around n=7) will be similar to that of the echo channel aulo-correlalfon 
function obtainad using the refation 

3S 

0 

40 

around k=0. 
Thspsfora, if 

is greater than a certain threshold, than it will ba determined that tha echo path impulse response has changed sub- 
^ stantlatty and a full training Is performed. The value of g Is obtained using the equation where 
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When full training is not required, Xh& bulk dolay and gaif> of tho echo filter id updated to new vaJues. (T-M) and h^^ 

The length and sirength of tha signai ucad for r^irainlng is ledd than that of the signals used during fuil training. 
Thus, tho quicl< training ie generaJIy noJ noticeable by peopiG in usar station 104. Tho qufck training Id performed for 
« both the echo channete. 

Adaptlvo Echo Canceliatjon 

Let hi{n), O^rx M, be the imputee reaponae of the echo path from ieft spoakar to tha microphono and /?^n), 0 <n< 
10 M» be the imputee response of the echo path from ieft speaker to the mbrophone. Furthennore. iet B, and fl^ ba the 
respectiva bulk delays, Lei, x,(n) and x^n) be the speech samples baing tad to tha left and right epeakere. Then, the 
echo oanceilation is performed by subtracting 



from the signai recorded by the microphone, This compulation can be implemented using the fast Fouriar Transform 
^ algorithm Of directly. 

Adaptive Sllaneo Detection 

Dua to non-Iinoariiy in the iransducera (speakers and microphone), it is impossible to achieve perfect echo can- 
^5 ceibllon. This residual echo can be percetved by the llalenere when the speaker is no! speaking. Tharaf ora, to alfminate 
echo vv^en the near-end speech Is absent, silence dateclion is performed and no speech packets are transmitted 
during the silencs psricxJs (i.e., when the nea;-end speaker is not speaking}. By not trarwmftting spasch information 
during the period of time when there is a silencs, ihs bandwidth requirement of the system Is further reduced. This 
addttional bandwidth can ba used to transmit other Infomiailon (e.g., graphics and/or control information). 
30 In a Typical environment, silence is not the same as no measurable audio signal. This Is because there is always 

some backgrourd noise and (some times) residual echo. Thus, microphone 118 general ty receives some audio signal> 
even when there is no speech. The equation of audio input to microphone 11S Is given by: 

A- = S- + E,. + 

3^ n n n A 

where is the amplituda of sound received by microphone 11 S, rs the conlnbution due to speach,.E„ is the con- 
tribution due to the residual echo, and 5^ is the contribution dug to background noise. 



The present inventrar makes use of the fact that human speech contains periods of silence. Thus, audio signaJ 



40 corresponds to background noise and echo during this periods. 

in one embodiment of the present invention, microphone 118 monitors the short-term signaf energy (computed 
every 20 msecs using a block of ^ msec signal) in a time period of one second. The segment hatvfng the lowest energy 
is assumed to be in a period of silence (i.e., S„ = = 0). Using the echo signal computed as earlier, the expected 
echo energy for a gVen time period can be easih^ computed. Tha residual echo energy for the given period is estimated 
to be equal to 0.1 times the expected echo energy. This assumes a conservative echo canceilatran oi 10 dB. Since 
Sn. En, and Bp are generated indepsndsntiy, tha anargy In the Signal A„ can be assumed to be equal to the sum of 
energy in each of the three component signals. In other words, E^ = E;^ - - Eg, whera is tha speech energy, Ea 
Is the energy In recorded speech, E^ is the residual echo energy and Eg is the background noise energy. A recorded 
segment of data is classified as silence it E^ Is found to be below a certain threshold. Otherwise, tha recorded speech 

^ is compressed and the compressed information is transmitted to the sen/en 

Software Automatio Gain Control 

The PC microphone is very sensitive to the distance between the speaker's mouth and the microphone. Automatic 
55 gain control (AGC) module is used 10 reduce Ihle S^sWviiy. Most mutti-medra PCs provide this functionality using 
hardware solutton. Need (or the software AQC arises because use of hardware AGC Introduces non-linearity In Ihe 
dala which can affect iho performance o1 echo and canceltaiion. The software AQC is implemented as follows: 
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1. fnitially set currsntGaln= i.O. 

2. Once every 60 msacs compute the root mean squared (rms) valuo u^irtg the relation 



3 




10 

where is echo trad (i.a., echo romovgd) non-eltdrtta input ep^och daia. 
a Compute target AGC gain, targeiGain using the relation 

4096 

t&rgetQB'm - mtn {targetGatn, ) 

4, set n=0 and do the folfowing N times 

jtrj- cuTTBntGsiin^Xjj 
SO cuirer\tOain<i-0.95'currentQain + 0,05*iargotGain 

rK~n + 1 

An Btrieient Speech Coding Scheme for V^ice Bridging Application 

25 In a mutti-point voice conferencing (I.e., voice bridging) system, each participant speaks Into his microphone con- 

nected to his multi-media computer or work station. The computer pertorms data compreseion to enable etttcient data 
transmission using modem to the server. The sen/er coftecte the speech packets from each participant. Using the 
knowledge about the position ot the participants and the direction of their faces, the een/er deckiee which epeakem (a 
maximum of two). If any, ahoukj be heard by each listener. In doing so, the server shoutd be able to combine bit -streams 

30 from two speakers Into one bit-siream, so that this bit-stream can be transmitted on the modem. The muttknedla 
system at the client side \vtil use this bit-stream to synthesize the voice of two participants, combina thsm and ganarata 
a composite signal which is played out through the speakers along with the ambient sound. The unique feature of the 
speech coding technique presented here ie that It \a designed to work as a dual rate codec That is the most important 
parameters of the speech coder output are coded at a rate of 6,400 bite per second. Additkxial information, v^ich can 

3S be ueed to improve the quality of synthesized speech, is codad using another 6,400 bits per second. This way, the 
coder can function as a high quality speech oompreeeion system at 12,800 bit per second or as a communication 
quality speech compression system at 6,400 bits per second. When we need to transmit the voEce data about two 
speakers, the sen/er allocates 6,400 bits per second (i.e., half-rate) for each speaker's votes. However, when only one 
person's speech Information needs to be transmitted to a client, full bandwidth is allocated for one spsakar's voice 

40 dat^ thus permining higher quality speech eyntheeie. The ability to take a 1 2,S00 bite per second bit-stream and convert 
it easily into a half-rate bit-stream (i.e.. 6,400 bits per sec brt-stream) can also be exploited whan matQ control or 
graphics intormatkyi has to be transmlned along with the speech data from the client to the eeiver. 

The algorithm allows for the client to make the decision for the senrer as fo whk:h of the arbitrary number of voice 
streams that coukl be sent to the client are actually chosen. Since the client knows the location of each ct The speakers, 

^ It can choose to take into account distance between the listener and speakers as well as direction of sight of all avatars^ 
objects that might be in the 'path of the sound,' even allow for different sound mediums (I.e. underwater, on a windy 
hill). The priority decision process is continually re-evaJuatedj and when this revaluation occurs fast enough, it can 
approximate to having many more concurrent streams between server and client than actually exist. 

The algorithm allows each client lo make such a decision irxfependently of the others, hence allowing a client to 

50 provida tha best audio experience for the user, taking into account the machine and bandwkfth limitations of that par- 
ticular client 

LPC Analysis 

S6 The speech compression system implemented here is based on the gsnsral princrpls of Linear Prediction Coding 

(LPC) of speech using multi-pulse excitation signals described in B. S. Atal and J, R. Remde, 'A new model for LPC 
excitation for producing natural sounding speech at low bit-rates,' Prvc int Cortf. On Acoustit^ Speech and Signal 
Processing, Paris. France. 1932, pp. 614-617. The speech signal. $„ , Is modeled as follows. 
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as the output of a digital ttme-varying rnfinits impulsa rdsponsa ftttsr oxorted by a sequence of Irregularty spaced pulses 
of diff srem amplttud©©, where, a^, 0<A£M are the order LPC fitter coafficlonta. The fifior or^r fs typteafJy about 1 0. 

Pn represents mulU-pulee sequence, ar^d 

repreeents the random rwise sequence. 

The time varying filler coefficients are obtaff^Qd using LPC anafy^ls ae descrfbed In S. Shankar Narayan and J. R 
Burg, •Spectral EBlimation of quaslperlodic data". IEEE Tmnsacthns on Acoustics Spasch and Signal Pfocesslng, 

IS March 1 990, pp.St 2-51S--554-1 1-557. Prior 1o LPC analysis, the speech data Is pre-emphaaized ussig a first order filter 
of the fomi fl[2)=(1 -0.3752-1). 7^9 coefficient estln^tlon Id performed once every 5 msecs. However, for the purpose 
of speech coding only every fourth paiamster sat are used. TTie higher rate of computation le for the purpose of lip- 
syrtch parameter dfitimation, which will bedfacueeed later. Typically 10-32 pulses are used to represent the multi-pulse 
excitatron function. The synthesized speech quality depends on the number of pulses used (more the batter). However, 

so the amount of diata ccmpresdlon achieved also depends on the number of pulses used to represent the excftatjon 
tunclton (fewsr the better). In order to transmit this Infomnatlon over the modem, the LPC parameters are quantized 
using a two-stage vector quantizer followed by a scalar quantizer to generate a 38-bit representation of the LPC fiHer. 
This procedure is discussed next, 

^ Veotor and Scalar Quanttzal Ion 

1. Convert the LPC filter coefRcienls to reflection coefficients using the transformation described in J. Markel and 
A. Gray. Linear Prediction of Speech. Springer-Verlag, 1 976. Reflection coefflclente are another representation of 
LPC filler coefficients. The transforn^ation from one aet of parameters to the other is loss less and invsrtible. Let 
f<pO<f<}A be the M reflection coefficients corresponding to the computed LPC titter coefficients. 

2. Convert the firsl four reflection coefflclente to lognarea-ratio par) functions using the method described In J. 
Markel and A. Gray, Linear Prediction of Speech, Sprlnger-Verlag. 1976. Use the transfonttation 



35 



^ where ln[.] stands for natural logartthm operation. 

a The 10 parametars farp*",laf4,kp'"M^oG^ quantized using a &4-codebook vector quantizer. The procedure for 
codebook generatBon and implementation of vector quantiser is de6Cfr*bed In Y LInde. A. Buzo and R M Gray; 
"An Algorithm for Vector Quantizer Design". iEEE Tisns. On CommunicaVons, Jan 1980, pp. 34-95; R. M. Gray, 
"Vector Quantization," iEEEASSP MagazinQ, April 1984, pp. 4-29. The vector quantizer accepts the vector X= 
/^r,>v',^r4.A^.*^,^Q]', searches through acodebook with 64 entries, finds the output sequence best rrtatchfngthe 
Input vector In the mean squared sense and outputs a 6-bil codeword Index ffcpr '^^^ rf6C0der can look up the 
corresponding codeword In fta codebook and obtain the sequence 
qXi=ilQi\s'"*lBf4,k'g,'"sk\t^^ as a 6-bit approximation to the gh/an input vector. 

4. A difference vector dX^^-^iX^ Is formed and is vector quantized again using a 64<sodebook vector quantizer, 
^ generating codeword index i^pi and the sequaice 

5. A difference vector dX^dX^q is formed next. Now each component of this vector is quantized individually (L 
s., using scalar quantizer©). The number of bite used to quantize each components are [4, 3, 3, 3, 3, 2, 2, 2, 2, 2]. 
This 26-bit infomnalion' along with the 2 6-ba VQ codes ilf^f^nd informs the 33^lt representatkjn of the LPC 

^ filter and this bifomrvatlon will be a part of the packet of information to be transmnted as encoded speech infoimation. 
Let the quantized values of the vector qXz h& defined as 

qX^ /;^"i.'".fer%. A"t-./c"'iQP. then, the decoded value of the vector XasX^qX^ + qX^-t- 
e. From this X vedor, the quantized LPC fitter coefficientB can be obtained by perfonning the needed transforma- 
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tlon© (I.0M bg-area-ratio to reflection coefficients and reflection coefficienlsto LPC fitter coflfficisnts). The sgqusnce 
^, 0 < >f i& used to designate the resutting fitter coetficteftte. 

GMfiretlon of murthpulso excitation parameter 

a 

Using an analysis by synthesis approach, the mutti-pulae excitation pammelera (I.e.. puJse poaltiona and ampli- 
tudes) ans Dbtainsd as follows; Lot S„ rBprossm speech data In.a 20 msec frame and let 0<n^Mbe the LPC filter 
ooeffKienta ol>talned for thle fiame of speech. COfT^pute the residual si^al, , using the retationship, 

10 

IS 

Compute impulse rssponsa of the perceptual weighting fitter using the retationship 
1 and 

so 

for ^<rHL . 

where X=0.85 and L is chosen as 40. 
Next form two sequences 

so 

S*'*'**** for 0^n<C2L-l) 

and 



*5 forOfin<I-7; 
so 



am 



forfZ.-i;£n<W-L; 

55 
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6 

for (N'L)&n<(hhL-1) In other words the sequsnca « obtainoci by convolving Th© &equ©nc© with Itself and the 
aequence q„ is obtained by convolving the resfdual sequence, /"^ wkh ' 

^0 The location of an axcitalion puis© Id chosen to be the value of nior which 9:, is maximum. Let this location bo . 

then the height of the pulse, is oblainsd using the relationship 

IS 

Atter each pufse infomnatlon (I.e., location and height) is obtained, th© sequence ^ nnodffied using the relation 

SQ 

This eequentraJ procedure is continued until the desired number of excitation pufees are determined. Using the 
pulse locations thus obtained, the Optinr^al height infbrmalion is then obtained by soMng the matrix equation Sjg^ h^S^ 
where S;^ is an (M x M) matrix whose (yf* element is equal to i-1 +l/r{i, S;^ Is an (Mxl ) vector, whose element is 
equal to ^^and h fa an (M x T) vector of excitation impulse heights. 

In an Bmbodimant of the present system, 1 2 excitation pulses are first determined sequentially for every 20 rreec 
oj speech data and then the pulse heights are optimized. 38-bits of LPC coefficiant Infonmatlon along with the Infor- 
mation about these 12 sxcftation pulses (I.e., height and amplitude) can be used as transmission parameters of a 
speech coding system with a bit-rate of 6.400 bits per sscond. The contribution troni these 12 pulses are subtracted 
from the aequance 7„ and an additional 14 excitation pulses are obtained using the same approach. In essence, 25 

^ mulll-pulee excitation pulses are detennined in two stages. 38-bits ctf LPC coefficient Information etong with the infor- 
rrration about these excitation pulses (1-e., height and amplitude) can be used as transmission parameters of a 
speech coding eyatem with a bit-rale ot 11.600 bits par second. In voice bJidging application, the client system (I.e.. 
multi-media computer station) in addition to sending the sen/er this 11 ,500 bits per second iit slrean^ also sends an 
additional Information of (l,2(X> bit per secorxi bitstrgam) as to which 12 pulses should be chosen amongst these 26 

5S pulses if the server wants to generate 6,400 bite per second bitstream data Thus the bandwidth required to sand 
speech infomiatton to the aenrer Is 12,800 bit per second, while the server sends the compressed speech data to the 
client either at 11,600 bits per second (In case of one speaker) or 12.600 (2x6,400) bits per second (in case of two 
epealcers). The encoding of the excitation pulse Information Is now described. 

<^ Pulse Posftion Coder 

Positions of 26 excitation pulses computed for each 20 msec segments of speech have to be coded efficiently in 
order to accomplish low-bit rate speech coding. 20 msec of speech corresponds io 1 60 samples of speech- Each pulse 
' can be In one of the 160 locatbrts, but no two pulse can have the same location. Combinatorial coding scheme pre- 
4S sentBd in tt. Berouti, er. aK "Efficient Computation and Encoding of the mulli-pufse excitation for LPC/ Pfdc int. Conf. 
On Acouatha Speech and Signal Processing, San Diego, CA, 1 984, pp. 1 0,1 . 1 -1 0. 1 A. Is used to code this Infonnaiion 
using 102 bits. The encoder uses the combinatorial coding scheme to code the inforniation needed by the sender to 
select 12 pulses out oflhese 26 pulses (^Cig corribinations or 24-blts) in order to generate half-rate bit stream. The 
same strategy is used In coding the positions of 12 pulses in the esse of half-rats coder. Thus a total of 114 bits are 
^0 needed to code the pulse location information for every 20 msecs. 

Pulse Amplitude Coder 

Amplitudes of 26 axcrtation pulses computed for each 20 msec segments of speech have to be coded efficiently 
S3 \n order to acconnpllsh bw-blt rate speech coding. The pulse amplitudes are coded etncisntty by normaliZTig them by 
the root mean-square (rms) puis© amplitude. After scaling, the pulse amplitudes are quantized using en S-levef gaussian 
quantizer described In T Max, ••Quantizing for minimum distortion. ■ /ftf Tmns. On information Thoory, vol. 1 6, 1 970. 
pp. 7-12. The rms pulse ampJitude which is optimized to mrnimizo the quantization noise Is coded as a Transmission 
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parameter using 6 bits. Thus a totaJ of 84 bits are needed to code the pulsa amplHudQ information for ©very 20 rraecs. 
An Effl&ient Lfp dynchmg Method 

* We have obaerved ihattha lip positions in human beings can be delarmtnsd to a large extent by the first two peaks, 

known as Jomiant frequencies, of the short-term spectra of the speech, and the short-term energy of the speech signal. 
Specifically, ihe separation between the firat two formant fraquQncias is proportional to the width of the Bp. The lower 
lip height tends to be proportional to the value the first tomiant frequency. Rnally the upper lip height and the lip 
rounding phenomenon (le„ both the lipe moving away from closed mouth position) Is Inverse^ proportional to the sum 

10 of the first two tormani frequencies. In other word©, both the first and second formant frequencies have to reduce to 
produce lip rounding. These eetbnates may not be ident'cal to the real lip positions of the speaker, but when used tor 
facial animation they are expected to provide very realistic effecte. The method for computing lip positions corrprises 
the following steps: 

IS up fnformatfon Extractor 

1. The IPC paranr^etere estimated for the purpose of speech encoding (as discussed aarlier) can be employed to 
obtain short-term spectra! information In the speech data However, the computational requirement for estimating 
the formant frequencies from the knowledge about the IPO filter infonnation would be high. In one embodiment 
of the present invention, f omrtant frequencies corresponding to each entry of the first S4-codeword VQ codebook 
is pre-computed In non-real-time ar>d stored. Therefore, when the first stage vector quantization is performed on 
the LPC parameters, an estimate of formant frequencies is also obtained. The symbols and are used as the 
first two fomnant frequertcies for a given segment of speech. Furthermore, the symbol £ is used as the si^al 
energy for the frame in decibels (dB). Qh/en the two formant frequencies and the signal energy and assuming a 
nominal lip width of size 1 unit, the following heuristics is used to gat preliminary estimates of the tip positions: 

2. Filtering of afgnal energy informatkan; Tha signal energy. E. computed for a frame of speech Includes the 
background noise energy also. The effect, of background noise level should be eliminated before using this infor- 
matkjn for the oafculailon of lip poshionsw The foltowtng algorithm Is used to modify the computed signal energy: 
Initially set 

AvcmgeS^naiLoY^hO Onc© every 5 msecs^ update this using the relation, 

A\/&ragGSignaIL&v^hQ.QSB*AYefsgeSigfiaa.ev^^4Q.00^ *E1\)9 signal energy, £, Is updated as 
5^^'(Av&mg&S^nafLevef-2a^ Set the value of £to 0 If fese than 0 and equal to 40 d3, if greater than 40. 

3. Lower Lip Height Computation: If la in the range (300-900 Hz), compute lower lip height using the relation 
foworUpMefghULS -coB^^/,-250)/500>t(£«C9 Olhenvlse, lower lip height is computed using the relatioai 
iowe/UfiHeight ~E/200 

4. Lip Wkfth Computation: The heuristics here is not to change the lip width if is in the region 1000-1800 Hz 
range. If is in the rang© 700-1000 Hz nange. the iip width is decreased using the relation 

ttpWkith= 4cos(n*(fa-76o)/300)]*^l 33 If is In the range 1S0O-250O Hz range, the lip width Is increased 
using the relation 

ifpWWf/j=1-j{1-KiOe(ii*(4-1B0Oy7O0)r£i^0O 

5. Lip Rounding: "mis is found to happen when + j^<1600 , 2S0< f^< 800 and 700< J^< ^0 . Then 
uppsfUpHeight=^ .2 *E11 . 1 -Kxtsfz^fi ^^eoo>/B00]y40 and the lower lip height is modified using th e equa- 
tion 

hw^rLipHefght=hwedJpHeigM■ta.S''F'l^,^^H^<^^n*{f^+^^ -flOO)/SOO}J/40 If lip rounding does not occur, the 
upper lip height depends mildly on the signal energy and Is calculated as 
upperUpHeigfiit=BAOO. 

Lip Filtering 

^ A dp position emoolfilng filter is used in the present embodiment. The iristanlaneou^ lip positions so obtained every 
5 msecs (or 200 t'mas a second) tend to be noisy. Furthermore, most visual displays are refreshed at a much lower 
rale (typically 6 to 30 tfemee per second). A 31 -point finite Impulse response (PlR) tow-pass filter wfth a cut-off frequency 
of 6.25 Hz is applied to each of the lip parameters to obtain smoolh estinriates of the parameters at the desired rate. 
In the current implementatfon, the smoothed fip parameters are computed 16.67 times per second. 

ss 

Lip Information Enccxiing 

The Op position variables are highly correlated. For example, when the mouth is widened. It Is likely that the 1^ 
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heights tend to be emaJI. On the other hand, when mouth is rounded, the lip width is smaJI and lip hsights are large. 
This iniorrnatton is axploitsd in coding the lip position using a very tew number ot bite. In the present Inventton, 8 bits 
par SO n^ec are used to encode all the lip position variables. The lower lip height information is codsd using a 1 6-{Qvgl 
quantizer and ft can take the following values: 

s 



JO 



15 



■ Code 


fi_wWCI Lip nviuiii 31 


0 


0.005 


1 


U.U/U 


2 


0.141 


3 


0.199 




0.252 


5 


0.306 


6 


0.377 


7 


0.458 


8 


0.554 


9 


0.653 


10 


0.752 


11 


0,876 


12 


1.034 


13 


1,213 


14 


1.468 


15 


1.025 



As thQ uppor lip halght information and iho Hp width are hlgh(y correlatad, ihey are jointly quantized uelng a 1$ 
level quantizer and can take the following valued: 



30 



36 



40 



Coda 


5*UpperUpH©fghl 


LIpWidth 


0 


0.017 


1.002 


1 


0.053 


1.005 


2 


0.189 


0.933 


3 


0.089 


1.002 


4 


0.532 


0.853 


5 


0,347 


0.943 


6 


0.743 


0.778 


7 


0.839 


0.723 


a 


0.048 


1.040 


9 


0.076 


1.038 


10 


0.055 


1.082 


11 


0.101 


1.081 


12 


0.065 


1.120 


13 


0.074 


1.155 


14 


0.032 


1.187 


15 


0.093 


1.225 



Thus the lip poaition information is coded using only 3 bits. 

Robot/Vi/hlsper ExDttatlon Qenerator 

As pointed out earlier, the speech signal can ba modalod as tho output of a tlma-varyrng digital fiRar (i.o,, IPC 
niter) excited by either random noise sequence or an Impulse puJae train. Typically. Virfien theapoken sound la a fricative 
(like sounds s« SH^ f), the filter Is excited by random noise. On the othar hand, when vovrals aro spokon, thQ tutor is 
excited by a quasi-periodic signal with a period corresponding to the pitch of the speaker. The filter excitation signal 
can be altered to accomplish the task ot voica morphing. In tha present ffivantion, morphing involves modifying the 
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excllation function parameters to produce the specific type cff voice dt^guisrng and modffication, 

A wfiispor epodch i& produced when there is no vocaJ cord >nbrat)on. which translates to no periodic pulse excttation 
in the case of LPC synthesizers (i.e., using onfy random not&a tor exciting the LPO filter), fl It 1$ desirable to generate 
a whi&per, the excitation signal ie changed to a random noise. The energy of the random noise is adjusted so as to be* 
s proportional to the actual energy of the speech, 

F^botic &poech fe produced when the pitch of the human speaker tends towards monotone (i.e., pitch changes 
very little during their tall<). In the context of LPC synihesis, thie tnanslaleQ to using a periodic putee excitation function, 
whose period changes very little. The energy of these impulses should be adjusted such that the energy of the syn- 
thesized speech is equal to that of the original speech. The periodicity of the robotic epeech Is specified by the user. 
70 To reduce the buzz in the synthesized speech, a tow frequency jitter is added to the pgriodicity information in the case 
of robotic speech. 

Voioe Morpher 

Another type of VObe morphing implemented in the system is that of altering the pitch of the speaker. This way, a 
male votes can be made to sound more like a femafe, and vico verea. Tho voice moophtng le aeoompltehed In two 
stages. Suppose the pitch frequency is to be increased by a factor r. First, the sampling frequency of lha speech data 
is decreased by the same pitch change factor r using speech Intorpolatlon/declmatfon technique. Thie also changes 
the duration of the speech samples (i.e., decreases the spee<^ duration by a factor of r). In order to keep the duration 
so of the pitch altered spesch same as that of the original speech, time acale modification of the speech is performed 
using the technique described in Werner VgrfiQlst. and Marc Roetarxte, 'An overlapadd technique based on waveform 
similarity (WSOLA) tor high quality time-scale modlficaiion speech." Proc. tnt Ccnf. On Acoustics Speech andStgnat 
Processing, Minneapolis, 1993, pp. !l-554-l)-557. This type of voice morphing is performed on the speech data prior 
to performing speech analysis at the encoder. 

26 

Bltstream Encoder 

The bitstream encoder accepts as Input the various pieces of encoding Infonration produced by the analyzer and 
pack them into a 96 byte packet, to be sent to the sen/er once every 60 msecs. It should however be noted that the 
3a speech anatysls Is performed once every 20 msecs. Therefore, the bh-^ream encoder uses three sets of analysis data 
to make one voice packet The (ol towing set of parameters computed every 20 msec and are Included In each voice 
packet: 

1 . VQ codeword h (S-brts) 
;?o 2. VQ codeword fi^^(6-brts) 

3. 1 0 resIduaJ 1 pc filter coefficients (26-bit8) 

4. rms pulse amplitude coda (6-bits) 

5. 2S pulse amplitude codes (78-bit3) 

6. pulse location cods (114 bits) 
*J 7. Lip position Information (B-bits per 50 msecs) 

d. Speech type cods ^.e.. whether it is whisper, riomial spdOCh or robotic sound). 

□oooder Implementetlon 

^5 ThQ bftstroam decoder accepts as Input the vole© packet set from the server and decodes ihem Into various pa- 
rameters to be used by the decoder to and implement the LPC synthesizer furiction to get synthesized speech. The 
synthesizer is implamentad using the following difference function: 




where are the decoded LPC filter coefficients of order M, are synthesized speech samples and are decoded 
55 excitation pulses. 
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Adaptive Post Fiher, Distance and Echo Flltvr, and Bandpass FUtsr 

Adapltve Post-fiftering is discussed in J.H. Chan and A. Garsho, 'P9al*TInn& Vector APC Speech Coding at 4S00 
bps with adapifve posi-filTerlng,' PfOC Int. Conf. On Acoustios Speech and Signal Processing Dallas, 1987, pp. 
2105-21S6. This fitterfng is applied io tha synthasizad cpsech to further Improve the speech quality. This system can 
creato special sound effeote lo elimulate the virtual spsce setting. For this purpose, echo or r6vart>aratjQn ftttering can 
be employod. Ihs echo filter is a first-order infinite impulse filter wJth the following system function 



H(z)= L 

1-GZ 



where D is the revertioration delay and G is the reverberation coetflcisnt (which has a magnitude lass than 1.0) to 
create the needed epeclaJ etfed The bandpass filtering is applied to the output of the echo filieno remove the DC and 
1$ low-frequBncy offset noise, and to effect the de-emphasis filtering. The filter transfer function is 



H(z}= 



(1-0.875zb(1-§z^) 



Using the knowledge about the spatial positions of the speaker and the listener In the virtual world, the dcstanca 
filtering is impfemented to convert the mono sound samples into stereo sound samples using the following algorithm: 

S5 1. Let 9/ and be listener and speaker angles in the virtual space and let dbe the distance betv^een them (In 
meters). 

2. Then distance gain. is obtained as 

^llr/a tfd>l 

3s 3. The left and right distance filter coefficients are compputed as 

t,=<3^(O.65-O.35*ein(0;- 1 )>*'(0.654O.35+cos(QJ)/(1 ^-0.5*006(0^)) 
^ i^=^*(1.co5(e;y2 

...^,=G^*{O.65-O.35+sin(0^ | )'r(0.65-^0.^5*cos(Q^))/(t.5■^0.S*cos(9^) 
/^=/?,*a-rC0SO;y2 

4. Finally, the left and right channel speech samples are computed using ths following filtering operations, 

so V^'^n-^^^'n-l 

SS where x„ is the cjutput of spaach synthosizer (mono sound), /„ and r„, are the reanlling left and right channel sound 
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Ami^lent Sound Genoration 

In a virtual socfallzaiion environments, background sounds are generated to create special offsets. Examples are 
when a participant laavas/antars a discussion group, the res! of the poopl© tn the group hear a sound Bke opening/ 
^ closing of a door. It could aldo be a background music in the virtual space. Such sounds are generatod by this modula. 

Wave Mixer 

The sound output through ihs acoustic spaakars can be ona of the following 

to 

1 . One person's speech only 

2. One per&on*& speech wrih background eound (music or ambient noise to simulate virtual space). 

3. Two person's speech only, or 

4. Two person's speech with background sound (music or ambient noise to simulata virtual space. 

IS 

The wave nrttxer takes as input different sourxj data streams, apply appropriate user specified gains, add them 
and finally apply soft dipping on the date to ensure high speech quality. 

The invemfon now being fully described, it will be apparent to ons of ordinary skill in the art that any changes and 
modffrcatiorvs can be mad© thereto without departing from the spirit or scope of the inventkjn as sat fortfi herein. Ac- 
20 cordingly, the present invention is to be limited solely by the scope of the apperkded clatnrte. 



Clalme 

1 . A system(1 00) for a plurality of users (1 04-1 06) conducting voice and image communica±ion on a wkJa area network 
(112), each usar boffig assoclaxed with a computer (114), a network access device (122) having a maximum data 
communlcetlon speed for connecting said computer to said network, a microphone (11 S) and a toudspsaker 
(1 1 9, 1 20), said microphone generating speech signals in response to audio signals, which ar^ then convened into 
digital speech data eald system comprising: 

a speech server (1 lO) connected to said natwrn-kfor managing data streams sent by user computers associated 
with said users; 

an encoder (200) for running on each one of said usar computers, said encoder (200) comprising: 
a compressor tor compressing said speech data received by said encoder into conpressed data, said com- 
pressor Including means (222) for generating a plurality of linear predictive coding (LPC) parameters; and 
a bit stream encoder (232) for encoding said compressed data Into an encoded data stream having a data 
rate below said maximum data communication speed; 

said bit stream encoder serving to generate a first encoded data stream having a first data rate from sakJ 
speech data of a first user computer and to generate a second encoded data stream having a second data 
rate from said speech data of a second user computer and 

said server including means for combining said first and sakJ second encoded data streams into a combined 
data stream having a data rate below said maximum data communication speed while said first and said 
second data rates have a sum above said maximum data communtcaiidh speed; and 
a decoder (300) mnning on a third user computer, said decoder (300) comprising: 
means for receiving said combtn€)d data stream; and 

means for reconstructing said audio signals received by eald microphones associated said first and said sec- 
ond user computers using information from said combined data stream. 

2. A system accordir^ to claim 1, wherein said decoder (304) funher comprises mdans lor simulating acoustic dts- 
50 tances between said first and said third user computers and between said second and said third user computers. 

3. A system according to claim 1 or 2, wherein eald decoder (300) further comprises means for simulating acoustic 
angles between said first and said third user computers and between said second and said third usar computers, 

^ 4. A system according to clakn 1 and further comprfsing; 

means for determining acoustic distance and acoustic angles between a selected user computer and a set of 
sakf user computers; 



40 
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meane for sdl^ng a dubsei oui of said aei of user oomputerd ba&dd dald acoustb distance arid said 
acoustic: angel; and 

nnsans for rscetvlng by said selected user computer only data stream originated from s^ subset of tiser 
computers. 

5 

5w A system according to any one of dainre 1 to 4, wherein said ancodar (200) further comprisos a votca morph 
means (21 4) ior altering said speech signal, 

6l a system according to claim 5, wherein the voice morph means s&rvss to shift a pitch of said speech signals of at 
to least one of said first and said second user computers. 

7. A system according to claim 6. wherein said voice morph means shifts said pitch by a constant value. 

8. A system accortling to claim 5, wherein said voice mo.T3h means comprises means for eliminating periodic com- 
fs ponents of said speech sigrials of at faafi^ one of said first and said sacond user computers. 

9. A system according to Claim 5, lAtiarein said voice morph means comprises; 

means for changing a sampling frequency of said speech signals; and 
^ means for changing a time scald of said spaach signals as a f unctbn of changa in said ^mpling f rsquerx:y 

1 0. A system according to any one of clarns 1 to9| wherein said encoder (200) further comprises means for determining 
a first fonr^nt frequency of sard speech signals using said LPO parameters; 

^ meane for determining a second formant frequency of said speech signals using said LPC parameters; and 

each of said user computers further comprising means tor displaying a lower lip positron and an upper lip 
positbn using said first and second formant frequencies. 

11. A system according to claim 10, wherein each of said user computers further comprises nieane for measuring 
energy of said speech signals, and wherein said means for displaying further including displaying a width of sard 
lips as proportional of said second formant frequency and a height of said lower iip as related to saEd first formant 
frequency and said energy. 

12. A system according to claim 10 or 11 , wherein said means tor displaying further comprises means lor displaying 
35 lip rounding as Inversely proportional to a sum of said first and saJd second formant frequencies. 

13. A system according to claim 12, wherein said width is in a coRelation relationship with said upper and said bwer 
lip positions and said rounding. 

40 1 4. A system according to claim 12 or 1 3, wherein said means for displaying f unher comprises means for smoothing 
said width, said height and sard rounding using a filter. 

18. A system according to claims 14, wherein said filter is a finite impulse response tow-pass filter 

^ 16l A system accortfing to claim 11 or claim 1 2,1 3,1 4 or 15 when appended to claim 1 1 , wherein said tower lip pOsHion 
Is computed from: 

If said first formant frequency is in the range of (300-300 i-fz), sard tower tip position (bwerLlpHeight) Is: 



50 



!0W9rLpH3jght = 1.5'COS(r.(fj'250}/500r(E/aO) 
otherwise said lower lip position is: 

tow9TUpH8}ght=E/2O0 
Where E is said signal energy and f. Is said first fonnant frequency 
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17. A system acoofding to claim 11 or claim 12, 13, 14, 1 5 or 16 whQn appgndgd to ctaim 11, wherein said lip width 
(lipWidlh) id computed from: 

if said second formsnt frequency is in the range 700-1000 Hz range, said lip width id decreased. 

s 

if said second fomnant frequency is in the range 1800-2500 Hz range, said pp width is incroased 

JO 

lipWtdth= l4{l4C03(n (f^- 180(y?VOrE/2O0; 
where E fs said signal energy and I2 is ^^id foimant frequency. 

IS 

IB. A system according to any one of the preceding claims and further comprising means for determining a silence 
stale \n a surrounding of one of said ueer computers, said eltenceetate being used by said compressor ae an input 
for compressing aakl speech data, said nneans for determining sa'td silencs state comprising: 

20 means for generallng a first source signal which is substantially a chirp sigr^al and for causing said microphone 

to play said first source signal as a first audio signal; 

means for generating a first digital signal based on said first audio signal received by said loudspeaker; 
a filter for processing said first digital signal matched to said chirp signal; 

meana for determining a bulk defay ae a time when eakJ processed first digital signal has a maximum value: 
SS rnoans for ggnorating a second source signal which is substanf ^Ity a whits noise and for causing said micro- 

phone to play eaid second source eignal ae a second audio algnal: 

means for generating a second digital signal based on said second audio signal received by said loudspeaker; 
meane for detemnlning a crose^orrelailon function of said second source signal and said second digital signal; 
means for generating an auto<:orreFa1ion of said second source signal; 
30 means for determining a finite impulse reaponee as a function of eaid croee-correlaiion and said auto-corre- 

(ation function; 

means for determining an echo cancellation energy using eald finite Impulse response and said bulk delay; 
mans for measuring acoustic energy received by said microphone; and 
means for measuring background nolee energy; 
3S said surroundings being classffied to be in said silence state when - * Eg is below a predetermined 

value, where is said acoustic energy measured by said microphone, is said echo cancellation ertergv^ 
and Eg Is said background noise energy. 

ie. A method for determining acoustic characteristics of a room havinga microphone and a loudspeaker, saki mcro- 
40 phone being connected to a computer through an analog to digital converter and said loudspeaker being connected 
to said computer through a digital to analog converter, said method comprising the steps of: 

generating, by said computer, a first source signal which is substantially a chirp signal; 
corrverting said first souree signal 10 a first audio signal by safd digital to analog converter and sakH mfcrophone; 
4S receiving saki first audk> signal by sakl ksudspeaker, 

converting sakj received first audio eignal by saU analog to digital converter to generate a first digital signal; 
processing sakj first digital signal by a fitter matched to sakj chirp signal; and 
detennining a bulk delay as a time when said processed signal has a maximum value. 

SO 20. A method according to claim 19, further comprising the steps of generating, by said computer, a second source 
signal which is substantially a white noise; 

converting said second source signal to a second audio signal by said digital to anabg converter and said 
mcrophono; 

SS receiving said second audio signal by said toudspeaker; 

converting said received second audio signal by said analog to digital converter to generate a second digftal 
signal; 

determining a cross-correlslion function of aafd second source signal and said second digital signal; and 6&- 
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tfifmins a finite impulse rosponsa as a function of said cross-correlation and an auto-corrolation function of 
said second source signal. 

21. A msthod according to clarm ^ or 21 . comprising the step of determining an echo cancellation signal by 

5 

w ' 

whors h(k) is said finlto Impulsa rosponsa and said 5 is said bulk rasponsG. 

16 



20 



2S 



35 



40 



45 



50 



S5 



19 



PAGE 71/81 ' RCVD AT 2/10/2005 3:58:12 PM [Eastern Standard Time] ' SVR:USPTO-EFXRF-1/10 * DNIS:8729306 ' CSiD:4803855fl61 ' DURATION (mm-ss):24-38 




PAGE 72/srRCVD AT 2110/2005 3:S8:12 PM [Eastern Standard Time]*SVR:USPTO-EF}(RF-1l10 * DNIS:S729306'CSID:480385i06r DURATION (mm-s$):24-38 



Feb. 10. 2005 2:18PM INGRASSIA FISHER & LORENZ PC 



No. 7367 P. 29/37 



EP0779732A2 



E 




21 

PAGE 73181 ' RCVD AT 2/10/2005 3:S8:12 PM [Eastern Standard Time]' SVRiUSPTO-EFXRF-l/IO * DNIS:8729306 ' CSiD:480385S061 ' DURATION (mm-ss):24'38 



Feb. 10. 2005 2:18PM INGRASSIA FISHER & L0REN2 PC 



No. 7367 P. 30/37 



EP0779 732 A2 





22 



PAGE 74/81 ' RCVD AT 2/1012005 3:58:12 PM [Eastern Standard Time]' SVR:USPTO-EFXRF-1/10 * DNIS:8729306* CSID:4803855061 * DURATION (mm-ss):24-38 



Jeb. 10. 2005 2:18PM INGRASSIA FISHER & LORENZ PC 

EP 0779732 A2 



No. 7367 P. 31/37 




23 



PAGE 75/81 ' RCVD AT 2/10/2005 3:58:12 PM [Eastern Standard Time] ' SVR:USPTO€F)(RF-1/10 ' DNIS:8729306 ' CSID:4803855081 * DURATION (min-s$):24-38 



feb. 10. 2005 2:18PM 1NGRAS81A FISHER & LORENZ PC 



Jo. 7367 P. 32/37 



EP0779 732 A2 



132 134 ^ 




Audio Transmitter Block 130 



FIG.3 



r 



144 



142 



2) 2i 




Audk) Receiver Block 140 



24 



PAGE 76/81 * RCVD AT 2/10/2005 3:58:12 PM [Eastern Standard Time] ' SVR:USPTO-EFXRF-1/10 ' mmmi * CSID:4803855061 ' DURATION (mm-ss):24-38 



.Feb. 10. 2005 2:18PM INGRASSIA FISHFR & LORENZ PC No. 7367 P. 33/37 



S 

CO 

e> 

o 
a. 

EU 



(19) 



J 



Europaischss Patanfamt 
European Patent Office 
Office europe&n des brevets 



(12) 



(11) EP 0 779 732 A3 

EUROPEAN PATENT APPLICATION 



(88) Dateof publication A3: 

10.05.2000 BullOtIn 2000/19 

(43) Data of publication A2: 

18.0e.1997 Bulletin 1997/25 

(21) Appiicatton number 96203451.8 

(22) DatB of filing: 06.12.1996 



(51) intci7: H04M 3/56. H04L 12/18, 
G10L5/00 



(84) Designated Contracting States: 


(72) Inventor Narayan, Shankar S. 


AT BE CH OB DK ES PR GB GR IE IT LI LU MC NL 


Palo Alfo. California 94306 (US) 


PT SE 




(74) Rapresantativa: BROOKES & MARTIN 


(30) Priorfty: 12.12,1995 US 571058 


High Holbom Hou«o 




52/54 High HoJbom 


(71) Applicant: OnLivel Technologies, Inc. 


London, WCI V 6SE (GB) 


Cupertino. California 95014 (US) 





(54) MuHi-potnt voicd conferencing system over a wide area network 



(57) An interactive networK system (1 00) communi- 
catas spaach and associated infonnation among a plu- 
rality of participants at ditTeranI sites (104. 106). An ex- 
ample of the associated information is lip synch Image 
information relatad to tha speech. The system contains 
a speech server (110) for managing data streams set by 
the participants. Each partfcJpant u$bs a multimedia 
computer (114) and a modem (122) to connect to the 
network. Because many modems have a low bit rata, it 
is important to compress the speech and associated in- 
fbmiation. The server (110) receives the data streams 
from at least two participants and contains means (200) 
for combining these data streams into a single data 
stream having a bit rate that can be handled by the mo- 
dem of the third participant. As a result a plurality of 
participants can conduct speech and image communi- 
cation using the network. 




PAllftd Jauve. 75001 PARIS (FR) 

PAGE 77/81 » RCVD AT 2110/2005 3:58:12 Plfl [Eastern Standard Time]* SVR;USPTO-EFXRF-1/10 * DNIS:8729306» CSID:480385506rDURATION fnm-ss):24.38 



feb. 



2005 2:19PM INGRASSIA FISHER & LORENZ PC 



No. 7367 P. 34/37 



EP 0 779 732 A3 



EUROPEAN SEARCH REPOFTT 



E? 96 20 3451 



DOCUMEffTS CONSIDERED TO BE RELEVANT 



Qtnfon of ci9curn9nt wrth ingestion, where Bppropn'flte, 
of rtetevant paasag^ 



CLAfiSlRCATIOM OFTHS 



US 5 383 184 A (CHAMPION TERREKCE G) 
17 January 1995 (1995-91-17) 

* abstract * 

* coluim 1, line 29-23 * 

* colunn 2, line 58-64 * 

* colunr 3, line 4-38 * 



US 5 398 177 A (NAHUHJ DROR} 
14 February 1995 (1995-02-14) 

* abstract * 

* col™ 1, line 69-62 * 

* coluim 3, line 31-37 * 

PATENT ABSTRACTS OF JAPAN 
voh Q16, no. 652 (E'1164), 

10 February 1992 (1992-92-10) 

& JP 93 252258 A (TOSHIBA CORP), 

11 HovKnber 1991 (1991-11-11) 

* abstract * 

PATENT ABSTRACTS OF JAPAN 

vol- 1995 • no, 07. 

31 August 1995 (1995-98-31) 

& JP 07 992988 A (MATSUSHITA ELECTRIC JHD 

CO LTD), 7 April 1995 (1995-94-97) 

* abstract * 

US 5 473 363 A (NG DENNIS ET AL) 
5 December 1995 (X995-12-05) 

* tplutnn 1. line 17-20 * 

* column 3, line 23-25 * 



H04M3/56 

HD4L12/18 

G19L5/09 



2,3 



2,3 



1-4 



lI i ff^jmUt ea b *i n driw n up f ar all th ir m 



TECHNICAL FfELDS 
SEARCHED (InLCLB) 



GIOL 

H04H 



THE HAGUE 



7 January 2000 



Etortncr 

Qu6lavoine» R 



CJkTtOOHY OF CrrfeD DOCUWFATTS 
Y:piBtKu laxly mlotfaft if aoiitimad Kii:h anfsihar 

A : {flcnubeintl uatgpMjnd 

O : n at i ■ jJt e n dhzfactim 
P : intormadtalD documofit 



T:-thoafyorp]{na^lH undoriviitg tfio tmwitkn 



a : mcwtior at the uuno pofiuit ftimiy, BonD^ndsig 



2 



PAGE 78181 ' RCVD AT 2/1012005 3:58:12 PM [Eastern Standard Time]* SVR:USPTO-EF}(RF-1/10 * DNIS:8729306 * CSID:4803855061 * DURATION (mm-ss):24'38 



feb. 10. 2005 2:19PM 



INGRASSIA FISHER & LORENZ PC 
EP 0 779 732 A3 



No. 7367 P. 35/37 




EuTOpd^ Patent 
Offtoo 



EP 96 29 3451 



App5 notion Ihimber 



CLAIMS INCURRINO FEES 



The present Europaan pa:tom appSo^on ocmipn&«d at ih6 time of fUing more than ten daim«^ 

□ Onfy part of the claJms havt been paid wRMn 1h« preecf ibdd tim^ titnit The piBsent European search 
report has been diwn irpfor the firat ten dainiB and for thos* claim* tor which daim« fMt hav» 
been paid, nameiy ctelmj^): 



□ No cl4im» fee* hav« b»tn paJd wWiln tt» prftaoribdd tim© fimit ThB preaerrt European aearch report has 
been drawn up tor the ^rat ten ctaiais. 



LACK CF UMITY OF INVEfmON 



ITm Search Dtvtelon oorutders that tha preaent aoropean patsfrt apfdieation does not comply wtlh the 
requiremenlB of unity of invention and relates to aav»ral Invtntloni or group* of invantbnft, namsly: 



see sheet B 



□ AD further ae^fch t«*« hw b*9n paid wtthin th« tbced time Bmit The proMiH Europdmi eeoroh repori hafi 
bMn drawn up for all daima. 



□ Aa aQ a^diable daima could be searched wKhout effort justifying an additional fee. the S*aroh DMsbn 
did rot tnvto paynMnt of any addHtonal f«o. 



□ Only part irf the furthtr search fee* have been paid whhin the JUed tinw Itmh. ThQ praaant Eunspean 
«earah report ha& been ttUMi up for thOde parta of the European patent appttcstion which retate to «a 
inwemiona m reapect of which search fee* hav* been paid, namely olainu: 



None of the further search fees have been paid wtlhin the fesxi time Itmit The preaerrt Europe^ eearch 
report has been drown itp for those parts of the European patent applfoation whl<* relafiB to the Irtvwitton 
^tt montioned in the claims, namely otainns: 



1-4 



3 



PAGE 79/81 ' RCVD AT 2/10/2005 3:58:12 PM [Eastern Standard Time] ' SVR:USPTO-EFXRF-1/10 ' ONIS:8729306 * CSID:4803855061 ' DURATION (mm-ss):24-38 



feb. 1 0. 2005 2:19PM INGRASSIA FISHER & LORENZ PC No. 7367 P. 36/37 

EP 0 779 732 A3 

ApfiEcatlon Numbsr 

EP 96 20 3451 

T)ie Search Division conwders lhat ih* prenm European patsm appTn^on does vtA ogmply with th« 
requirennenta oi ufifty of invwrlion and relates to savvraf Invention & or groups of inventions, name]/,' 

1, Claims: 1-4 

nultlmedid conferencing system with a virtual spatial voice 
position effect, and means to select the spatial area each 
attendant wishes to listen 

2- Clairas: 5-9 

voice morphing effect 

3. Claims: lD-17 
lips movement simulation In synchronization with audio 

4. Claim ; 18 
echo cancellation 

5. Claims: 19-21 

adaptive noise/echo measures in a room and cancellation 
method 



EuropBan M«nt ^^^^ INVENTION 

^ SHEETS 



4 



PAGE 80/81 ' RCVD AT 2/10/2005 3:58:12 PM pstem Standanl Time]* SVR:USPTO-EFXRF-1/10 * [)NIS:8729306 * CSiD:48038S5061 * DURATION (inm-ss):24-38 



ieb. 1 0. 2005 2:1 9f 



IfJGRASSIA FISHER & LOREfJZ PC 



No. 7367 P. 37/37 



EP 0 779 732 A3 



^NEX TO THE EUROPEAN SEARCH REPORT 
ON EUROPEAN PATENT APPUCATION NO. 



EP 96 29 3451 



T}i4Aannaatl8t9tho patent fvT^nwmbm reletting to th4p4^ ^abwt-mw^anedBttopeani&Areihf&patt 

Tho tnAflobon ar> » dontahedu) th* European Patent OfAoa EDP stA on 

ThvEufspoan Patent Offic9 tain no isblelbr these portieubnt which vBrnerefygMn far the 

G7-81-2G00 



PAtCftt doCLDTlCflf 

dtsd in BGBfsh report 


PubBuilm 




date 


US 5383184 


A 


17-01-1995 


US 

us 

AU 
WO 
US 


5317567 A 
5457685 A 
2321892 A 
9305595 A 
5272698 A 


31-05-1994 
18-10-1995 
85-04-1993 
18-03-1993 
21-12-1993 


US 5399177 


A 


14-02-1995 


CA 
EP 
JP 


2114868 A,C 
0617537 A 
7895380 A 


25-09-1994 
28-09-1994 
87-04-1995 


JP 6325225S 


A 


11-11-1991 


NONE 






JP 07992988 


A 


07-84-1995 


NONE 






US 5473363 


A 


85-12-1995 


CA 
EP 
WO 


2169571 A 
0724806 A 
9663831 A 


68-82-1996 
87-08-1996 
08-02*- 1996 





u rngrn about t^ift annex ; cm 0^iA«^ JotimsA of thft curopun Pstsnt Office, hh, 1 2^92 



5 



PAGE 81/81 * RCVD AT 2/10/2005 3:58:12 PM [Eastern Standard Time] * SVR:USPTO'EFXRF-1/10 ' ONIS:8729306' CSID:4803855061 * DURATION (mni-ss):24-38 



